46 research outputs found
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees
Efficient methods for storing and querying are critical for scaling
high-order n-gram language models to large corpora. We propose a language model
based on compressed suffix trees, a representation that is highly compact and
can be easily held in memory, while supporting queries needed in computing
language model probabilities on-the-fly. We present several optimisations which
improve query runtimes up to 2500x, despite only incurring a modest increase in
construction time and memory usage. For large corpora and high Markov orders,
our method is highly competitive with the state-of-the-art KenLM package. It
imposes much lower memory requirements, often by orders of magnitude, and has
runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational
Linguistics (TACL) 201
Feature Combination for Measuring Sentence Similarity
Sentence similarity is one of the core elements of
Natural Language Processing (NLP) tasks such as
Recognizing Textual Entailment, and Paraphrase Recognition.
Over the years, different systems have been proposed to
measure similarity between fragments of texts. In this
research, we propose a new two phase supervised learning
method which uses a combination of lexical features to
train a model for predicting similarity between sentences.
Each of these features, covers an aspect of the text on
implicit or explicit level. The two phase method uses all
combinations of the features in the feature space and trains
separate models based on each combination. Then it creates a
meta-feature space and trains a final model based on that.
The thesis contrasts existing approaches that use feature
selection, because it does not aim to find the best subset of
the possible features. We show that this two step process
significantly improves the results achieved by single-layer
standard learning methodology, and achieves the level of
performance that is comparable to the existing state-of-the-art
methods
Investigating Pre-trained Audio Encoders in the Low-Resource Condition
Pre-trained speech encoders have been central to pushing state-of-the-art
results across various speech understanding and generation tasks. Nonetheless,
the capabilities of these encoders in low-resource settings are yet to be
thoroughly explored. To address this, we conduct a comprehensive set of
experiments using a representative set of 3 state-of-the-art encoders
(Wav2vec2, WavLM, Whisper) in the low-resource setting across 7 speech
understanding and generation tasks. We provide various quantitative and
qualitative analyses on task performance, convergence speed, and
representational properties of the encoders. We observe a connection between
the pre-training protocols of these encoders and the way in which they capture
information in their internal layers. In particular, we observe the Whisper
encoder exhibits the greatest low-resource capabilities on content-driven tasks
in terms of performance and convergence speed.Comment: INTERSPEECH 202
Plug-and-Play Recipe Generation with Content Planning
Recent pre-trained language models have shown promising capabilities in
generating fluent and realistic natural language text. However, generating
multi-sentence text with global content planning has been a long-existing
research question. Current approaches for controlled text generation can hardly
address this issue, as they usually condition on single known control
attributes. In this study, we propose a low-cost yet effective framework which
explicitly models the global content plan of the generated text. Specifically,
it optimizes the joint distribution of the natural language sequence and the
global content plan in a plug-and-play manner. We conduct extensive experiments
on the well-established Recipe1M+ benchmark. Both automatic and human
evaluations verify that our model achieves the state-of-the-art performance on
the task of recipe generationComment: Paper accepted by EMNLP 2022 GEM worksho
Koala: An Index for Quantifying Overlaps with Pre-training Corpora
In very recent years more attention has been placed on probing the role of
pre-training data in Large Language Models (LLMs) downstream behaviour. Despite
the importance, there is no public tool that supports such analysis of
pre-training corpora at large scale. To help research in this space, we launch
Koala, a searchable index over large pre-training corpora using compressed
suffix arrays with highly efficient compression rate and search support. In its
first release we index the public proportion of OPT 175B pre-training data.
Koala provides a framework to do forensic analysis on the current and future
benchmarks as well as to assess the degree of memorization in the output from
the LLMs. Koala is available for public use at
https://koala-index.erc.monash.edu/.Comment: Available here: https://koala-index.erc.monash.edu
FireAct: Toward Language Agent Fine-tuning
Recent efforts have augmented language models (LMs) with external tools or
environments, leading to the development of language agents that can reason and
act. However, most of these agents rely on few-shot prompting techniques with
off-the-shelf LMs. In this paper, we investigate and argue for the overlooked
direction of fine-tuning LMs to obtain language agents. Using a setup of
question answering (QA) with a Google search API, we explore a variety of base
LMs, prompting methods, fine-tuning data, and QA tasks, and find language
agents are consistently improved after fine-tuning their backbone LMs. For
example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4
leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct,
a novel approach to fine-tuning LMs with trajectories from multiple tasks and
prompting methods, and show having more diverse fine-tuning data can further
improve agents. Along with other findings regarding scaling effects,
robustness, generalization, efficiency and cost, our work establishes
comprehensive benefits of fine-tuning LMs for agents, and provides an initial
set of experimental designs, insights, as well as open questions toward
language agent fine-tuning.Comment: Code, data, and models are available at
https://fireact-agent.github.i